<!-- To compile slides with relative paths, use --> <!-- xaringan::inf_mr(cast_from = '../..') --> <!-- For self contained, just knit with markdown --> <style type="text/css"> /* Apply styles only to the first slide with the title-slide class */ .title-slide { display: flex; justify-content: center; align-items: left; flex-direction: column; height: 800px; text-align: left; } /* Style for the logo in the title slide */ .title-slide img { float: left; margin-right: 20px; width: 100px; } /* Style the title */ .title-slide h1 { color: #15803d; font-size: 40px; font-weight: bold; margin-bottom: 30px; } /* Style the subtitle */ .title-slide h3 { color: #6b7280; font-size: 30px; margin-bottom: 20px; } /* Style the author and date */ .title-slide p { color: #6b7280; font-size: 18px; margin-bottom: 10px; } .inverse .remark-slide-number { display: none; } .remark-slide-number { font-size: 14px; } </style> <!-- <div class="title-slide"> --> <!-- <!-- <img src="../../img/logo.png" alt="Logo" class="title-logo" width="40px"> --> <!-- <img src="logo.png" alt="Logo" class="title-logo" width="40px"> --> <!-- <h1>Data Handling: Import, Cleaning and Visualisation</h1> --> <!-- <h3>Lecture 1: Introduction</h3> --> <!-- <p>Dr. Aurélien Sallin<br>01/10/2023</p> --> <!-- </div> --> .title-slide[ <img src="data:image/png;base64,#../../img/logo.png" width="35%" style="display: block; margin: auto;" /> #Data Handling: Import, Cleaning and Visualisation ### Lecture 1: Introduction Dr. Aurélien Sallin<br>01/10/2023 ] --- ## Welcome to Data Handling 2024! - Go to this app (use the QR code): https://datahandling.shinyapps.io/DataHandlingIntro/ - Use one row to respond to the questions in the column headers (see the first two rows for examples). <img src="data:image/png;base64,#../../img/QR_APP_2024.png" width="35%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds1.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds2.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds3.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds4.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds5.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds6.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds7.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds8.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds9.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds10.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds11.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds12.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds13.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/ds14.png" width="85%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#../../img/data_science_pipeline.png" width="85%" style="display: block; margin: auto;" /> --- class: center, middle, inverse, Large # Background --- ## 'Data Science'? <!-- <br> --> *"This coupling of scientific discovery and practice involves the collection, management, processing, analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational, and inter-disciplinary applications."* University of Michigan 'Data Science Initiative', 2015 --- ## But, what about statistics?! *"Seemingly, statistics is being marginalized here; the implicit message is that statistics is a part of what goes on in data science but not a very big part. At the same time, many of the concrete descriptions of what the DSI will actually do will seem to statisticians to be bread-and-butter statistics. Statistics is apparently the word that dare not speak its name in connection with such an initiative!"* David Donoho (2015). __50 years of Data Science__ --- ## What's new about all this? *"All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: ..."* --- ## What's new about all this? *"All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: <br> procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."* --- ## What's new about all this? <img src="data:image/png;base64,#../../img/tukey.jpg" width="35%" style="display: block; margin: auto;" /> .center[.small[John Tukey (_The Future of Data Analysis_, 1962!)]] --- ## Technological change <img src="data:image/png;base64,#../../img/computers.jpg" width="80%" style="display: block; margin: auto;" /> --- ## Relevance for modern economic research <img src="data:image/png;base64,#../../img/css.png" width="80%" style="display: block; margin: auto;" /> --- ## Relevance for modern economic research <img src="data:image/png;base64,#../../img/internet.png" width="90%" style="display: block; margin: auto;" /> --- ## Relevance for modern economic research <img src="data:image/png;base64,#../../img/bigdata.png" width="80%" style="display: block; margin: auto;" /> --- ## Relevance for modern economic research <img src="data:image/png;base64,#../../img/text.png" width="80%" style="display: block; margin: auto;" /> --- ## Data science in Economics skill set <img src="data:image/png;base64,#../../img/venn_diagramm.png" width="50%" style="display: block; margin: auto;" /> --- ## Data science as a life skill <img src="data:image/png;base64,#../../img/datascientistsexy.png" width="80%" style="display: block; margin: auto;" /> --- ## Data science as a life skill "More than anything, what data scientists do is **make discoveries while swimming in data.** ... As they make discoveries, they communicate what they’ve learned and suggest its implications for new business directions. Often they are *creative in displaying information visually and making the patterns they find clear and compelling*... They advise executives and product managers on the implications of the data for *products, processes, and decisions*. What kind of person does all this? *Think of him or her as a hybrid of data hacker, analyst, communicator, and trusted adviser. The combination is extremely powerful — and rare.*" --- class: center, bottom background-image: url("data:image/png;base64,#../../img/break-picture.png") background-size: cover .announcement-style[::Break::] --- class: inverse, center, middle # Philosophy of this course --- class: right, bottom background-image: url("data:image/png;base64,#https://upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Apprenticeship.jpg/1280px-Apprenticeship.jpg") background-size: cover <style type="text/css"> .slide { position: relative; width: 100%; height: 100%; } .legend-pic { position: absolute; bottom: 0; right: 0; padding: 10px; /* Optional: Add some padding */ white-space: nowrap; } </style> .slide[.legend-pic[.small[.white[A shoemaker and his apprentice c.1914, Emile Adan]]]] --- ## .green[**At the end of the course, you will be able to...**] - **Understand the tools you need when working with data** <br> We will use the programming language R, but principles are similar for any other programming language (👞⚙️) - **Work independently with data** <br>We will learn how to collect, clean, and analyze data so that you can conduct a data project in Economics (research/consulting/...) from start to finish - **Ask the right questions to a dataset**<br> We will learn how to ask the right questions to a dataset - **Learn to communicate about data**<br> We will learn to present our results in a clear and compelling way --- ## My commitment to these goals and to your learning process - **Transferrable skills** - **Hands-on approach** - **Emphasis on real-world relevance** <br>(caveat: this course is mandatory for Econ students, I have limited freedom in the syllabus) - **As much fun as possible (as coding can be fun...😎)** --- ## Your commitment to the course - Prepare with reading, visit the lecture, recap key concepts in lecture notes (self-study) - Work on exercises, come to exercise session, tackle the tricky exercises together! - Code, code, and code. repeat... ``` r try <- 0 while(try < 999) { try <- try + 1 } cat("success!") ``` ``` ## success! ``` --- ## Our Team - At Your Service <div class="custom-table"> | | | | |----------------------|-----------------------------|-----------------------------| | <img src="data:image/png;base64,#../../img/fede.jpg" height="150"/> | <img src="data:image/png;base64,#../../img/andreaburro.jpg" height="150"/> | <img src="data:image/png;base64,#../../img/aureliensallin.jpg" height="150"/> | | Federica Mascolo | Andrea Burro | Aurélien Sallin | </div> <style> .custom-table table th { width: 20%; border-top: 0px; border-bottom: 0px; } .custom-table table thead th { border-top: 0px; border-bottom: 0px; } .custom-table thead th, .custom-table tr:nth-child(1) { background-color: white; } </style> --- ## Introduction: Aurélien Sallin - 2022-today: Expert in Health Care Research and Member of Management, SWICA Health Organization, Winterthur - 2022-today: Lecturer, HSG - 2018-2022: PhD Economic and Finance, HSG <br> <br> <br> <!-- Previously: --> <img src="data:image/png;base64,#../../img/gsp.png" height="90"/> <img src="data:image/png;base64,#../../img/unifr.png" height="90"/> <img src="data:image/png;base64,#../../img/logo.png" height="80"/> <img src="data:image/png;base64,#../../img/swica-logo-e.svg" height="70"/> --- ## Introduction: Aurélien Sallin .green[Research at SWICA] - Use Real-World Data from claims to assess effectiveness of health technological tools - Use (Causal) Machine Learning to evaluate the effect of health policies on doctors' prescription behaviors - Develop financing models for mandatory health care in Switzerland <br> .green[Other Research in Economics of Education (during my PhD Economic and Finance)] - Missclassification rates for gifted students - Evaluation of Special Education programs --- class: inverse, center, middle # Organisation of the Course --- ## Course concept: lectures - Lectures (Thursday morning) - Background/Concepts - Illustration of concepts - Illustration of 'hands-on' approaches --- ## Course concept: exercises - Exercise sheets (handed out every other week) - Some conceptual questions - Hands-on exercises/tutorials in R - *First Exercises (set up R/RStudio) is available on StudyNet/Canvas today* --- ## Course concept: exercise sessions - In-class exercise sessions (bi-weekly evening sessions) - Discussion of exercises and additional input with Federica and Andrea - Recap of concepts - Q&A, support - time for more coding! --- class: inverse, center, middle <!-- background-image: url("../../img/roadahead.jpg") --> <!-- background-size: cover --> # The road ahead --- ## Two special lectures - .green[**24.10.2024**: R from a student's perspective] - Minna Heim, BA from St. Gallen, student in Data Science at ETH Zurich <br> - .green[**05.12.2024**: Industry and Consulting Insights] - Rachel Lund, PhD: Data Science Lead at Deloitte --- ## Part I: Data (Science) fundamentals <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Topic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: green !important;"> 19.09.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;"> Introduction: Big Data/Data Science, course overview </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: green !important;"> 26.09.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;"> Programming with R </td> </tr> <tr> <td style="text-align:left;border-bottom: 20px solid white;"> 26.09.2024 </td> <td style="text-align:left;border-bottom: 20px solid white;"> Exercises 1: Tools, programming </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: green !important;"> 03.10.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;"> An introduction to data and data processing </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: green !important;"> 10.10.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;"> Data storage and data structures </td> </tr> <tr> <td style="text-align:left;border-bottom: 20px solid white;"> 10.10.2024 </td> <td style="text-align:left;border-bottom: 20px solid white;"> Exercises/Workshop 2: Data storage and data structures </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: green !important;"> 17.10.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;"> Rectangular data </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: green !important;"> 24.10.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;"> Non-rectangular data. Guest spot: Minna Heim </td> </tr> <tr> <td style="text-align:left;"> 24.10.2024 </td> <td style="text-align:left;"> Exercises/Workshop 3: Web data, text, and images </td> </tr> </tbody> </table> --- ## Part II: Data gathering and preparation <style type="text/css"> .remark-slide table { width: 100%; border-top: 0px; border-bottom: 0px; } .remark-slide table thead th { border-top: 0px; border-bottom: 0px; } .remark-slide thead, .remark-slide tr:nth-child(even){ background-color: white; } .remark-slide thead{ background-color: grey; } table{ border-collapse: collapse; } .remark-slide thead:empty { display: none; } </style> <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Topic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: green !important;"> 14.11.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;"> Data preparation and manipulation </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: green !important;"> 21.11.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;"> Basic statistics and data analysis with R </td> </tr> <tr> <td style="text-align:left;border-bottom: 25px solid white;"> 21.11.2024 </td> <td style="text-align:left;border-bottom: 25px solid white;"> Exercises/Workshop 4: Data gathering, data import </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: green !important;border-bottom: 25px solid white;"> 28.11.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;border-bottom: 25px solid white;"> Visualisation </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: green !important;border-bottom: 25px solid white;"> 05.12.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;border-bottom: 25px solid white;"> Guest Lecture: Data Handling @Deloitte (Rachel Lund, Senior Economist) </td> </tr> <tr> <td style="text-align:left;"> 05.12.2024 </td> <td style="text-align:left;"> Exercises/Workshop 5: Data preparation and applied data analysis with R </td> </tr> </tbody> </table> --- ## Part III: Analysis, visualisation, output <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Topic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: green !important;"> 12.12.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;"> Analytics, more visualisation, and data products </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: green !important;border-bottom: 25px solid white;"> 19.12.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;border-bottom: 25px solid white;"> Summary, Wrap-up, Final workshop </td> </tr> <tr> <td style="text-align:left;border-bottom: 25px solid white;"> 19.12.2024 </td> <td style="text-align:left;border-bottom: 25px solid white;"> Exercises/Workshop 6: Visualization, dynamic documents </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: green !important;"> 19.12.2024 </td> <td style="text-align:left;font-weight: bold;color: green !important;"> Exam for Exchange Students </td> </tr> </tbody> </table> --- ## Exam information - Central, written examination: *digital, BYOD!*. - Multiple choice questions. - A few open questions. - Theoretical concepts and practical applications in R (questions based on code examples). --- ## Exam information II - We will release samples of multiple choice questions via Quizzes on Canvas/Studynet (exact same format and style of exam questions). - Exchange students who need to take the exam before the central exam block: - Date, time place, : *19.12.2024, 16:15-18:00, room tbd*. - Questions: *andrea.burro@unisg.ch* --- class: inverse, center, middle # The tools --- ## Core course resources - All information and materials (notes, slides, course sheet, syllabus, etc.) are available on StudyNet/Canvas. - Use github to be always updated about the course material - Install git on your computer as explained [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) - Clone the course repository using .code-bg-red[ ``` r git clone git@github.com:ASallin/datahandling-lecture.git # to clone git pull origin main # to update ``` ] <style type="text/css"> .code-bg-red .remark-code, .code-bg-red .remark-code * { background-color:#f8f8f8!important; font-size: 19px; padding: .5em; } </style> --- # Why R? .green[***The* data language**] - Widely used in Data Science jobs. - Originally designed as a tool for statistical analysis. - Particularly useful to program with data. .green[**High-level language**] - Relatively easy to learn. - A lot of free tutorials and support online. .green[**Free, open-source, large community** ] - Used in various fields. - Thousands of 'R-packages' covering diverse aspects of data analysis. - Learn from open sources. --- ## R <img src="data:image/png;base64,#../../img/R_logo.svg.png" width="40%" style="display: block; margin: auto;" /> Install R from [here](https://stat.ethz.ch/CRAN/)! --- ## RStudio <img src="data:image/png;base64,#../../img/rstudio.png" width="40%" style="display: block; margin: auto;" /> Install RStudio from [here](https://www.rstudio.com/products/rstudio/download/#download)! --- ## Main textbooks [Data Handling Pocket Reference](https://umatter.github.io/datahandling/) [Murrell, Paul (2009). *Introduction to Data Technologies*, London: Chapman & Hall/CRC.](https://www.stat.auckland.ac.nz/~paul/ItDT/) [Wickham, Hadley and Garred Grolemund (2017). *R for Data Science*, 1st Edition. Sebastopol, CA: O’Reilly.](http://r4ds.had.co.nz/) [Baumer, Kaplan and Norton (2023). *Modern Data Science with R*, 2nd Edition. ](https://mdsr-book.github.io/mdsr3e/) --- ## Further resources - [Stackoverflow](https://stackoverflow.com/questions) - [Get inspired in the R blogsphere](https://www.r-bloggers.com) - ChatGPT --- ## And now this... 